Multilingual Named-Entity Recognition from Parallel Corpora
نویسندگان
چکیده
We present a named-entity recognition (NER) system for parallel multilingual text. Our system handles three languages (i.e., English, French, and Spanish) and is tailored to the biomedical domain. For each language, we design a supervised knowledge-based CRF model with rich biomedical and general domain information. We use the sentence alignment of the parallel corpora, the word alignment generated by the GIZA++[8] tool, and Wikipedia-based word alignment in order to transfer system predictions made by individual language models to the remaining parallel languages. We re-train each individual language system using the transferred predictions and generate a final enriched NER model for each language. The enriched system performs better than the initial system based on the predictions transferred from the other language systems. Each language model benefits from the external knowledge extracted from biomedical and general domain resources.
منابع مشابه
Cross-lingual Transfer of Named Entity Recognizers without Parallel Corpora
We propose an approach to cross-lingual named entity recognition model transfer without the use of parallel corpora. In addition to global de-lexicalized features, we introduce multilingual gazetteers that are generated using graph propagation, and cross-lingual word representation mappings without the use of parallel data. We target the e-commerce domain, which is challenging due to its unstru...
متن کاملCollaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain―some MANTRAs
The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages—an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corp...
متن کاملPOLYGLOT-NER: Massive Multilingual Named Entity Recognition
The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named ...
متن کاملLearning Formulation and Transformation Rules for Multilingual Named Entities
This paper investigates three multilingual named entity corpora, including named people, named locations and named organizations. Frequency-based approaches with and without dictionary are proposed to extract formulation rules of named entities for individual languages, and transformation rules for mapping among languages. We consider the issues of abbreviation and compound keyword at a distance.
متن کاملConstruction of a Vietnamese Corpora for Named Entity Recognition
In order to build an automatic named entity recognition (NER) system using a machine learning approach, a large tagged corpus is widely seen as one necessary knowledge resource. Nevertheless, manual construction is time consuming, labor intensive and expensive. Building NER corpora for European languages has been extensively studied while some less-studied languages such as Vietnamese have not ...
متن کامل